Dynamic Clustering-Based Estimation of Missing Values in Mixed Type Data
نویسندگان
چکیده
The appropriate choice of a method for imputation of missing data becomes especially important when the fraction of missing values is large and the data are of mixed type. The proposed dynamic clustering imputation (DCI) algorithm relies on similarity information from shared neighbors, where mixed type variables are considered together. When evaluated on a public social science dataset of 46,043 mixed type instances with up to 33% missing values, DCI resulted in more than 20% improved imputation accuracy over Multiple Imputation, Predictive Mean Matching, Linear and Multilevel Regression, and Mean Mode Replacement methods. Data imputed by 6 methods were used for prediction tests by NB-Tree, Random Subset Selection and Neural Networkbased classification models. In our experiments classification accuracy obtained using DCI-preprocessed data was much better than when relying on alternative imputation methods for data preprocessing.
منابع مشابه
Performance evaluation of different estimation methods for missing rainfall data
There are numerous methods to estimate missing values of which some are used depending on the data type and regional climatic characteristics. In this research, part of the monthly precipitation data in Sarab synoptic station, east Azerbaijan province, Iran was randomly considered missing values. In order to study the effectiveness of various methods to estimate missing data, by seven classic s...
متن کاملMixAll: Clustering Mixed data with Missing Values
The Clustering project is a part of the STK++ library (Iovleff 2012) that can be accessed from R (R Development Core Team 2013) using the MixAll package. It is possible to cluster Gaussian, gamma, categorical, Poisson, kernel mixture models or a combination of these models in case of mixed data. Moreover, if there is missing values in the original data set, these missing values will be imputed ...
متن کاملDealing with Incomplete Data in Clustering
Over the years, significant developments have taken place in the direction of clustering numeric, categorical or mixed data. A new challenge is to cluster data with missing attribute values. The early algorithms used Fuzzy c-means to partition data into fuzzy clusters and estimate the missing values through estimation algorithms. Recently, Hathaway and Bezdek have proposed four strategies for e...
متن کاملMissing value estimation methods for DNA microarrays
MOTIVATION Gene expression microarray experiments can generate data sets with multiple missing expression values. Unfortunately, many algorithms for gene expression analysis require a complete matrix of gene array values as input. For example, methods such as hierarchical clustering and K-means clustering are not robust to missing data, and may lose effectiveness even with a few missing values....
متن کاملApplication of Soft Computing Methods for the Estimation of Roadheader Performance from Schmidt Hammer Rebound Values
Estimation of roadheader performance is one of the main topics in determining the economics of underground excavation projects. The poor performance estimation of roadheader scan leads to costly contractual claims. In this paper, the application of soft computing methods for data analysis called adaptive neuro-fuzzy inference system- subtractive clustering method (ANFIS-SCM) and artificial neu...
متن کامل